Systematic Characterizations of Text Similarity in Full Text Biomedical Publications
نویسندگان
چکیده
BACKGROUND Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central. METHODOLOGY/PRINCIPAL FINDINGS 72,011 full text articles from PubMed Central (PMC) were parsed to generate three different datasets: full texts, sections, and paragraphs. Text similarity comparisons were performed on these datasets using the text similarity algorithm eTBLAST. We measured the frequency of similar text pairs and compared it among different datasets. We found that high abstract similarity can be used to predict high full text similarity with a specificity of 20.1% (95% CI [17.3%, 23.1%]) and sensitivity of 99.999%. Abstract similarity and full text similarity have a moderate correlation (Pearson correlation coefficient: -0.423) when the similarity ratio is above 0.4. Among pairs of articles in PMC, method sections are found to be the most repetitive (frequency of similar pairs, methods: 0.029, introduction: 0.0076, results: 0.0043). In contrast, among a set of manually verified duplicate articles, results are the most repetitive sections (frequency of similar pairs, results: 0.94, methods: 0.89, introduction: 0.82). Repetition of introduction and methods sections is more likely to be committed by the same authors (odds of a highly similar pair having at least one shared author, introduction: 2.31, methods: 1.83, results: 1.03). There is also significantly more similarity in pairs of review articles than in pairs containing one review and one nonreview paper (frequency of similar pairs: 0.0167 and 0.0023, respectively). CONCLUSION/SIGNIFICANCE While quantifying abstract similarity is an effective approach for finding duplicate citations, a comprehensive full text analysis is necessary to uncover all potential duplicate citations in the scientific literature and is helpful when establishing ethical guidelines for scientific publications.
منابع مشابه
Dysphagia Improvement Using Acupuncture Therapy: A Systematic Review
Background. Dysphagia is a common complication in patients with stroke. The research on acupuncture treatment of dysphagia has increased, but the results are not consistent. In this review we intend to answer “what is the potential of acupuncture in treating dysphagia in stroke patients and which acupuncture points are the most promising for treating dysphagia?” Methods. This systematic review...
متن کاملFuture competencies for hospital management in developing countries: Systematic review
Background: This was a systematic review presenting the future competencies for hospital managers. Methods: Participants, interventions, comparisons and outcomes (PICO) strategy with MeSH terms were used for searching. Databases used were Web of Science, PsycINFO and Medline, EBSCO, ScienceDirect, Emerald, ProQuest, Social Sciences Research Network, Embase, and some Iranian database su...
متن کاملSemantics - based Text Mining of Biomedical Concepts in
Searching publications for prior work on scientific concepts is central to the research process. The relevant parts of retrieved publications are typically found and evaluated manually. In the field of biomedicine, due to rapidly growing numbers of publications and the of lack standard scientific terminologies, this task is particularly challenging, complex and time consuming. Prior information...
متن کاملDistribution of information in biomedical abstracts and full-text publications
MOTIVATION Full-text documents potentially hold more information than their abstracts, but require more resources for processing. We investigated the added value of full text over abstracts in terms of information content and occurrences of gene symbol--gene name combinations that can resolve gene-symbol ambiguity. RESULTS We analyzed a set of 3902 biomedical full-text articles. Different key...
متن کاملFigure Text Extraction in Biomedical Literature
BACKGROUND Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures ef...
متن کامل